Data for this example are downloaded from site: http://www.neuroepigenomics.org/methylomedb/download.html We used 4 control samples: Control1 AC, Control2 AC, Control3 AC and Control4 AC and 4 disease samples: SCZ1 AC, SCZ2 AC, SCZ3 AC, SCZ4 AC. After downloading we summed all results on the same position and chromosome separately from control and disease samples.
In this file we created analysis of DMR in chromosome 1. The code about preprocessing files can be found in examples/prep.MethylomeDB.R in this repo. Result of this script are two data.frames: examples/control.Rds and examples/disease.Rds which are used in this document.
Let’s analyze!
Read our data:
control <- readRDS('control.rds')
datatable(head(control))
disease <- readRDS('disease.rds')
datatable(head(disease))
Preprocess data:
data <- preprocessing(control, disease)
Now we have one data.frame with all necessary values binding on chromosome and position:
datatable(head(data))
Check correlation:
data %>% filter(prob == 'x') %>% mutate(a = log((meth + 1)/(unmeth + 1))) %>% pull(a) -> a
data %>% filter(prob == 'y') %>% mutate(b = log((meth + 1)/(unmeth + 1))) %>% pull(b) -> b
cor(a,b)
## [1] 0.8895702
So we see that these probes are highly correlated.
acf.met <- function(prob_i, n, title){
prob_i$m <- log((prob_i$meth+1)/(prob_i$unmeth+1))
wek <- numeric(max(prob_i$poz))
wek[prob_i$poz] = prob_i$m
wek[wek==0] = NA
acfG <- acf(wek, lag.max=n, na.action = na.pass)
autoplot(acfControl) + ggtitle(title) +
geom_vline(xintercept = 1000, color = 'red') +
geom_vline(xintercept = 2000, color = 'red') +
geom_vline(xintercept = 3000, color = 'red') +
geom_vline(xintercept = 4000, color = 'red') +
geom_vline(xintercept = 5000, color = 'red') + xlim(0, 5000) + theme_minimal() +
theme(title = element_text(size = 20), axis.title = element_text(size = 17),
axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16))
}
prob_i <- readRDS('disease.rds')
prob_i$m <- log((prob_i$meth+1)/(prob_i$unmeth+1))
wek <- numeric(max(prob_i$poz))
wek[prob_i$poz] = prob_i$m
wek[wek==0] = NA
acfG <- acf(wek, lag.max=5000, na.action = na.pass)
saveRDS(acfG, 'acfDisease.rds')
acf.met(control, 5000, 'Acf in control probe')
acf.met(disease, 5000, 'Acf in disease probe')
We also see that methylation is highly correlated on adjoining positions in the same probe. This situation is still visible between observerations that are distant over 1000 bp.
Now we can create region. After creation they will be tested if there is significant difference between two probes in methylation rate.
One way is to use create.tiles.fixed.length function, See ?create.tiles.fixed.length if you want some details.
data.tiles.1 <- create.tiles.fixed.length(data, tiles.length = 1000, common = F)
datatable(head(data.tiles.1))
data.tiles.2 <- create.tiles.fixed.length(data, tiles.length = 1000, common = T)
datatable(head(data.tiles.2))
We can can also use create.tiles.max.gap. See ?create.tiles.max.gap if you want some details.
data.tiles.3 <- create.tiles.max.gap(data, gaps.length = 100)
datatable(head(data.tiles.3))
Now we can compare how perform our functions.
stats.1 <- get.stats(data.tiles.1)
stats.2 <- get.stats(data.tiles.2)
stats.3 <- get.stats(data.tiles.3)
datatable(head(stats.3))
We got basic statistics about two probes. This is helpful if we want check coverage of created regions or methylation difference.
stats.all <- gdata::combine(stats.1, stats.2, stats.3)
ggplot(stats.all, aes(x = meth.diff, group = source, col = source)) + geom_density() + theme_minimal() +
ggtitle('Distibution of methylation difference') +
theme(
plot.title = element_text(size=20),
axis.title.x = element_text(size=18),
axis.title.y = element_text(size=18),
axis.text.x = element_text(size=14),
axis.text.y = element_text(size=14),
legend.text = element_text(size = 14),
legend.title = element_text(size = 16))
ggplot(stats.all, aes(x = meth.cov, group = source, col = source)) + geom_density() + theme_minimal() + xlim(0,100) +
ggtitle('Distibution of coverage regions') +
theme(
plot.title = element_text(size=20),
axis.title.x = element_text(size=18),
axis.title.y = element_text(size=18),
axis.text.x = element_text(size=14),
axis.text.y = element_text(size=14),
legend.text = element_text(size = 14),
legend.title = element_text(size = 16))
We also can join e.g stats.3 and data.tiles.3 on chromosome, start and end column and analyzing regions only if they coverage is greater than some specyfic values or other condition.
Get DMR basen on creation regions by create.tiles.max.gap function. See ?find.DMR if you want some details.
result <- find.DMR(data.tiles.3, methods = c('Wilcoxon', 'Ttest', 'KS', 'Reg.Log', 'Reg.Mixed', 'Reg.Corr.Mixed'))
Based on draw.methylation function we plotted ten best result from find.DMR.
top.n <- function(result, data, title){
p <- list()
for (i in 1:10){
p[[i]] <- draw.methylation(data, chromosom = 'chr1', start = pull(result[i, 'start']), end = pull(result[i, 'end']), bind.probes = F,
size.x.dot = 7, size.y.dot = 4, plot.title = 16, axis.title.x = 12, axis.title.y = 12, legend.title = 10, legend.text = 8,
axis.text.x = 9, axis.text.y = 9)
}
grid.arrange(arrangeGrob(p[[1]], p[[2]], ncol = 2),
arrangeGrob(p[[3]], p[[4]], ncol = 2),
arrangeGrob(p[[5]], p[[6]], ncol = 2),
arrangeGrob(p[[7]], p[[8]], ncol = 2),
arrangeGrob(p[[9]], p[[10]], ncol = 2),
nrow = 5, top=textGrob(title, gp=gpar(fontsize=25,font=8)))
}
top.n(result$Wilcoxon, data, 'Top results by Wilcoxon test')
top.n(result$Ttest, data, 'Top results by t-test')
top.n(result$KS, data, 'Top results by K-S test')
top.n(result$Reg.Log, data, 'Top results by logistic regression')
We can also sorting regions from logistic regression by effect.size:
result$Reg.Log.Beta <- result$Reg.Log %>% filter(p.value < 0.001) %>% arrange(-abs(beta.coef))
The outcome above is equivalent if we run:
result <- find.DMR(data.tiles.3, methods = c('Reg.Log'), p.value.log.reg = 0.001)
top.n(result$Reg.Log.Beta, data, 'Top results by logistic regression')
top.n(result$Reg.Mixed, data, 'Top results by log.reg with random effects')
result$Reg.Mixed.Beta <- result$Reg.Mixed %>% filter(p.value < 0.001) %>% arrange(-abs(beta.coef))
datatable(head(result$Reg.Mixed.Beta))
The outcome above is equivalent if we run:
result <- find.DMR(data.tiles.3, methods = c('Reg.Corr.Mixed'), p.value.log.reg = 0.001)
top.n(result$Reg.Mixed.Beta, data, 'Top results by log.reg with random effects sorting on beta')
top.n(result$Reg.Corr.Mixed, data, 'Top results by log.reg with random effects with given corr. matrix')
result$Reg.Corr.Mixed.Beta <- result$Reg.Corr.Mixed %>% filter(p.value < 0.001) %>% arrange(-abs(beta.coef))
datatable(head(result$Reg.Corr.Mixed.Beta))
The outcome above is equivalent if we run:
result <- find.DMR(data.tiles.3, methods = c('Reg.Corr.Mixed'), p.value.log.reg = 0.001)
top.n(result$Reg.Corr.Mixed.Beta, data, 'Top results by log.reg with random effects with given corr. matrix sorting on beta')
All result of function find.DMR estimated on data.tiles.3 can be found in examples/result.rds file.
get_top <- function(data, n, stats){
data %>% slice(1:n) %>%
left_join(stats, by = c('chr', 'start', 'end')) %>%
dplyr::select(quantile, meth.diff, meth.cov)
}
top <- do.call(gdata::combine,lapply(result, get_top, n = 100, stats = stats.3))
ggplot(top, aes(x = source, y = quantile)) + geom_boxplot(aes(fill = source)) + ggtitle('Distribution of rank rate by methods') +
scale_fill_brewer(palette = "Set1") + theme_minimal() + theme(title = element_text(size = 20), axis.title = element_text(size = 17),
axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16, angle = 20) , legend.position = "None") +
labs(x="Method",y="Rank rate") +
scale_x_discrete(limits=c('Reg.Mixed','Reg.Corr.Mixed.Beta','Reg.Corr.Mixed', 'KS', "Wilcoxon", 'Reg.Log.Beta', 'Reg.Mixed.Beta', 'Reg.Log', 'Ttest'))
ggplot(top, aes(x = source, y = meth.diff)) + geom_boxplot(aes(fill = source)) + ggtitle('Distribution of methylation absolute difference by methods') +
scale_fill_brewer(palette = "Set1") + theme_minimal() + theme(title = element_text(size = 20), axis.title = element_text(size = 17),
axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16, angle = 20) , legend.position = "None") +
labs(x="Method",y="Methylation difference") +
scale_x_discrete(limits=c('Reg.Mixed','Reg.Corr.Mixed.Beta','Reg.Corr.Mixed', 'KS', "Wilcoxon", 'Ttest', 'Reg.Log', 'Reg.Log.Beta', 'Reg.Mixed.Beta'))
ggplot(top, aes(x = source, y = meth.cov)) + geom_boxplot(aes(fill = source)) + ggtitle('Coverage distribution by methods') +
scale_fill_brewer(palette = "Set1") + theme_minimal() + theme(title = element_text(size = 20), axis.title = element_text(size = 17),
axis.text.y = element_text(size = 16), axis.text.x = element_text(size = 16, angle = 20) , legend.position = "None") +
labs(x="Method",y="Coverage") +
scale_x_discrete(limits=c('Reg.Corr.Mixed.Beta','Reg.Log.Beta','Reg.Mixed.Beta', 'Reg.Mixed', "KS", 'Reg.Log', 'Reg.Corr.Mixed', 'Ttest', 'Wilcoxon'))
We can show, that these method given very different results. The rank rate is very good if we use Ttest, Reg.Log, Reg.Mixed.Beta and Reg.Log.Beta methods. The biggest differences we see for methylation coverage. Wilcoxon test recommended regions that have the most observations ~ 200. Reg.Corr.Mixed.Beta and Reg.Log.Beta proposed smaller group - about 10 observations. If we check methylation difference, Reg.Log.Beta and Reg.Mixed.Beta present very large difference in recommended regions.